Hi I need a solution to a problem in the R programming language
library(gtools)
end_date <- "2021-12-31"
ddf1 <- data.frame(pnr=c("1","1","2","2","3","3","3","4"),
in_date=as.POSIXct(c("2010-08-18","2010-09-01","2019-04-02","2018-03-27",
"2019-07-12","2013-10-20","2012-07-01","2015-05-02")),
out_date=as.POSIXct(c("2010-12-04",NA,"2019-05-17",NA,
NA,"2017-08-19",NA,NA)),
treat1=c(1,1,1,1,1,1,1,1)
)
ddf2 <- data.frame(pnr=c("4","4","3","3","2","2","2","1"),
in_date=as.POSIXct(c("2010-08-18","2010-09-01","2019-04-02","2018-03-27",
"2019-07-12","2013-10-20","2012-07-01","2015-05-02")),
out_date=as.POSIXct(c("2010-12-04",NA,"2019-05-17",NA,
NA,"2017-08-19",NA,NA)),
treat2=c(1,1,1,1,1,1,1,1)
)
expected_output1 <- data.frame(pnr=c("1","2","3","3","4"),
in_date=as.POSIXct(c("2010-08-18","2018-03-27",
"2019-07-12","2012-07-01","2015-05-02")),
out_date=as.POSIXct(c("2010-12-04","2019-05-17",end_date,
"2017-08-19",end_date)),
treat1=c(1,1,1,1,1)
)
expected_output2 <- data.frame(pnr=c("4","3","2","2","1"),
in_date=as.POSIXct(c("2010-08-18","2018-03-27",
"2019-07-12","2012-07-01","2015-05-02")),
out_date=as.POSIXct(c("2010-12-04","2019-05-17",end_date,
"2017-08-19",end_date)),
treat2=c(1,1,1,1,1)
)
ddf <- smartbind(ddf1,ddf2)
expected_output <- smartbind(expected_output1,expected_output2)
> ddf
pnr in_date out_date treat1 treat2
1:1 1 2010-08-18 2010-12-04 1 NA
1:2 1 2010-09-01 <NA> 1 NA
1:3 2 2019-04-02 2019-05-17 1 NA
1:4 2 2018-03-27 <NA> 1 NA
1:5 3 2019-07-12 <NA> 1 NA
1:6 3 2013-10-20 2017-08-19 1 NA
1:7 3 2012-07-01 <NA> 1 NA
1:8 4 2015-05-02 <NA> 1 NA
2:1 4 2010-08-18 2010-12-04 NA 1
2:2 4 2010-09-01 <NA> NA 1
2:3 3 2019-04-02 2019-05-17 NA 1
2:4 3 2018-03-27 <NA> NA 1
2:5 2 2019-07-12 <NA> NA 1
2:6 2 2013-10-20 2017-08-19 NA 1
2:7 2 2012-07-01 <NA> NA 1
2:8 1 2015-05-02 <NA> NA 1
> expected_output
pnr in_date out_date treat1 treat2
1:1 1 2010-08-18 2010-12-04 1 NA
1:2 2 2018-03-27 2019-05-17 1 NA
1:3 3 2019-07-12 2021-12-31 1 NA
1:4 3 2012-07-01 2017-08-19 1 NA
1:5 4 2015-05-02 2021-12-31 1 NA
2:1 4 2010-08-18 2010-12-04 NA 1
2:2 3 2018-03-27 2019-05-17 NA 1
2:3 2 2019-07-12 2021-12-31 NA 1
2:4 2 2012-07-01 2017-08-19 NA 1
2:5 1 2015-05-02 2021-12-31 NA 1
I have some individuals who have gone through different treatments, treat1 and treat2.
I need to deal with the fact that some treatment courses were started but lack an out_date
In the case when there lacks an out_date it should be replaced with an end_date for the study:
end_date <- "2021-12-31"
However, if an observation like the one
pnr in_date out_date treat1 treat2
1:1 1 2010-08-18 2010-12-04 1 NA
1:2 1 2010-09-01 <NA> 1 NA
where the in_date, meaning that start of the treatment, is within the period of another treatment for that same person or "pnr" then the correct output is:
pnr in_date out_date treat1 treat2
1:1 1 2010-08-18 2010-12-04 1 NA
because 2010-08-18 is the earliest in_date.
However another case where the there is an earlier date in the row without an out_date, then this date should be used, which is the case for pnr 2
pnr in_date out_date treat1 treat2
1:3 2 2019-04-02 2019-05-17 1 NA
1:4 2 2018-03-27 <NA> 1 NA
becomes:
pnr in_date out_date treat1 treat2
1:2 2 2018-03-27 2019-05-17 1 NA
So the whole period of treatment is covered, with the earliest in_date and latest out_date.
In cases where there is no out_date the end_date should be set instead;
so that:
pnr in_date out_date treat1 treat2
1:8 4 2015-05-02 <NA> 1 NA
becomes:
1:5 4 2015-05-02 2021-12-31 1 NA
In the special case where both an earlier date, or intersecting date, and a later in_date with a missing out_date the function should be able to handle it, like with pnr 3
1:5 3 2019-07-12 <NA> 1 NA
1:6 3 2013-10-20 2017-08-19 1 NA
1:7 3 2012-07-01 <NA> 1 NA
should become:
pnr in_date out_date treat1 treat2
1:3 3 2019-07-12 2021-12-31 1 NA
1:4 3 2012-07-01 2017-08-19 1 NA
OPTIONAL: If it is at all possible, it would be great if the function could handle this differently according to different treatments, so each pnr is handled differently within each treat1 and treat2 also shown in the expected_out
I have attempted to write some code for comparing whether an out_date is NA, and the differences between dates, but I still cant fathom how to continue:
ddf$end_replaced <- as.integer(ifelse(is.na(ddf$out_date),1,0))
ddf <- data.table(ddf)
ddf <- ddf[order(ddf$treat1,ddf$pnr,ddf$in_date,ddf$out_date),]
ddf[, diffx := difftime(in_date, shift(in_date, fill=in_date[1L]),
units="days"), by=pnr]
Thanks for reading
UPDATE
I ended up solving the issue, its not pretty so if anyone has a better solution than this then please let me know
I want to fill in missing values for a case with values from cases in a different file. The corresponding cases have the same refrence number, variable REF. In the end, there should only be be one case per reference number, with no missing values in any variable. I already tried: Data-> Merge files-> Add variable-> many to one, but I still end up with multiple cases per reference number or no change at all in the table. I can't figure out how this works.
My two data sets:
REF p1 p2 p3
1 5 NA NA
2 3 NA NA
3 4 NA NA
REF p1 p2 p3
1 NA 3 NA
1 NA NA 1
2 NA 2 NA
2 NA NA 4
3 NA 1 NA
3 NA NA 1
Desired output:
REF p1 p2 p3
1 5 3 1
2 3 2 4
3 4 1 1
What I tried, but did not work:
I suggest you first stack the two files, so that all the data is in one table, then use aggregation to get all the data for each case into one line. I suggest aggregation using the max function under the assumption that for every REF only one value exists in each column, so the aggregation will select this value and leave out the other "competing" missing values.
EDITED to leave only one line per "REF":
add files /file = dataset1 /file = dataset2.
exe.
dataset name gen.
aggregate /outfile=* /break=REF /P1 P2 P3=max(P1 P2 P3).
After data cleaning and aggregation I was left with a data table like this:
df
id d1 v1 d2 v2 d3 v3 d4 v4
1 1-1-2018 1 1-1-2018 1 1-1-2018 1 1-1-2018 1
2 1-1-2018 1 1-2-2018 2 1-2-2018 2 1-2-2018 2
3 1-1-2018 1 1-2-2018 2 1-3-2018 3 1-3-2018 3
4 1-1-2018 1 1-2-2018 2 1-3-2018 3 1-4-2018 4
I am trying to remove any values from a column in the above data frame that are duplicates of earlier columns.
I have already tried:
df$v2[df$v1 == df$v2] <- NA
this removed all of the values from the v2 column
I want my data frame to look like this at the end:
df
id d1 v1 d2 v2 d3 v3 d4 v4
1 1-1-2018 1 NA NA NA NA NA NA
2 1-1-2018 1 1-2-2018 2 NA NA NA NA
3 1-1-2018 1 1-2-2018 2 1-3-2018 3 NA NA
4 1-1-2018 1 1-2-2018 2 1-3-2018 3 1-4-2018 4
Try df[...condition here...]$column <- NA
Or with data.table:
library(data.table)
dt <- data.table(df)
dt[d1 == d2, v1 := NA]
I am trying to import a txt file which has NA values in some columns, which are in numeric format (double precision)
A B C D E F G H I J
100 0.05 NA 11556135.4 1.22911 NA 5.19 NA 17572151.86 3.45E+08
100 0.25 25 11556135.4 1.32911 NA 5.19 NA 17572151.86 69552000
100 0.09 NA 13405172.5 1.16911 44 5.233 23 47253072.8 5.20E+08
100 0.11 NA 15434493.7 1.18911 NA 5.212 NA 55434589.68 5.25E+08
I am getting an error
ERROR: invalid input syntax for type double precision: "NA"
this is the code that I am using to import the file.
copy bond FROM '~/filtered77k.txt' WITH CSV HEADER DELIMITER AS E'\t'
copy bond FROM '~/filtered77k.txt' WITH CSV HEADER DELIMITER AS E'\t' NULL 'NA' I
I have written a script in Perl that merges files based on a mapping in a third file; the reason I am not using join is because lines won't always match. The code works, but gives an error that doesn't appear to affect output: Use of uninitialized value in join or string at join.pl line 43, <$fh> line 21. As I am relatively new to Perl I have been unable to understand what is causing this error. Any help resolving this error or advice about my code would be greatly appreciated. I have provided example input and output below.
join.pl
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use Tie::File;
use Scalar::Util qw(looks_like_number);
chomp( my $infile = $ARGV[0] );
chomp( my $infile1 = $ARGV[1] );
chomp( my $infile2 = $ARGV[2] );
chomp( my $outfile = $ARGV[3] );
open my $mapfile, '<', $infile or die "Could not open $infile: $!";
open my $file1, '<', $infile1 or die "Could not open $infile1: $!";
open my $file2, '<', $infile2 or die "Could not open $infile2: $!";
tie my #tieFile1, 'Tie::File', $infile1 or die "Could not open $infile1: $!";
tie my #tieFile2, 'Tie::File', $infile2 or die "Could not open $infile2: $!";
open my $output, '>', $outfile or die "Could not open $outfile: $!";
my %map1;
my %map2;
# This loop will read two input files and populate two hashes
# using the coordinates (field 2) and the current line number
while ( my $line1 = <$file1>, my $line2 = <$file2> ) {
my #row1 = split( "\t", $line1 );
my #row2 = split( "\t", $line2 );
# $. holds the line number
$map1{$row1[1]} = $.;
$map2{$row2[1]} = $.;
}
close($file1);
close($file2);
while ( my $line = <$mapfile> ) {
chomp $line;
my #row = split( "\t", $line );
my $species1 = $row[1];
my $reference1 = $map1{$species1};
my $species2 = $row[3];
my $reference2 = $map2{$species2};
my #nomatch = ("NA", "", "NA", "", "", "", "", "NA", "NA");
# test numeric
if ( looks_like_number($reference1) && looks_like_number($reference2) ) {
# do the do using the maps
print $output join("\t", $tieFile1[$reference1], $tieFile2[$reference2]), "\n";
}
elsif ( looks_like_number($reference1) )
{
print $output join("\t", $tieFile1[$reference1], #nomatch), "\n";
}
elsif ( looks_like_number($reference2) )
{
print $output join("\t", #nomatch, $tieFile2[$reference2]), "\n";
}
}
close($output);
untie #tieFile1;
untie #tieFile2;
input_1:
Scf_3L 12798910 T 0 41 0 0 NA NA
Scf_3L 12798911 C 0 0 43 0 NA NA
Scf_3L 12798912 A 42 0 0 0 NA NA
Scf_3L 12798913 G 0 0 0 44 NA NA
Scf_3L 12798914 T 0 42 0 0 NA NA
Scf_3L 12798915 G 0 0 0 44 NA NA
Scf_3L 12798916 T 0 42 0 0 NA NA
Scf_3L 12798917 A 41 0 0 0 NA NA
Scf_3L 12798918 G 0 0 0 43 NA NA
Scf_3L 12798919 T 0 43 0 0 NA NA
Scf_3L 12798920 T 0 41 0 0 NA NA
input_2:
3L 12559896 T 0 31 0 0 NA NA
3L 12559897 C 0 0 33 0 NA NA
3L 12559898 A 34 0 0 0 NA NA
3L 12559899 G 0 0 0 33 NA NA
3L 12559900 T 0 34 0 0 NA NA
3L 12559901 G 0 0 0 33 NA NA
3L 12559902 T 0 33 0 0 NA NA
3L 12559903 A 33 0 0 0 NA NA
3L 12559904 G 0 0 0 33 NA NA
3L 12559905 T 0 34 0 0 NA NA
3L 12559906 T 0 33 0 0 NA NA
map:
3L 12798910 T 12559896 T
3L 12798911 C 12559897 C
3L 12798912 A 12559898 A
3L 12798913 G 12559899 G
3L 12798914 T 12559900 T
3L 12798915 G 12559901 G
3L 12798916 T 12559902 T
3L 12798917 A 12559903 A
3L 12798918 G 12559904 G
3L 12798919 T 12559905 T
3L 12798920 T 12559906 T
output:
Scf_3L 12798910 T 0 41 0 0 NA NA 3L 12559896 T 0 31 0 0 NA NA
Scf_3L 12798911 C 0 0 43 0 NA NA 3L 12559897 C 0 0 33 0 NA NA
Scf_3L 12798912 A 42 0 0 0 NA NA 3L 12559898 A 34 0 0 0 NA NA
Scf_3L 12798913 G 0 0 0 44 NA NA 3L 12559899 G 0 0 0 33 NA NA
Scf_3L 12798914 T 0 42 0 0 NA NA 3L 12559900 T 0 34 0 0 NA NA
Scf_3L 12798915 G 0 0 0 44 NA NA 3L 12559901 G 0 0 0 33 NA NA
Scf_3L 12798916 T 0 42 0 0 NA NA 3L 12559902 T 0 33 0 0 NA NA
Scf_3L 12798917 A 41 0 0 0 NA NA 3L 12559903 A 33 0 0 0 NA NA
Scf_3L 12798918 G 0 0 0 43 NA NA 3L 12559904 G 0 0 0 33 NA NA
Scf_3L 12798919 T 0 43 0 0 NA NA 3L 12559905 T 0 34 0 0 NA NA
Scf_3L 12798920 T 0 41 0 0 NA NA 3L 12559906 T 0 33 0 0 NA NA
The immediate problem is that the indices of the tied arrays start at zero, while the line numbers in $. start at 1. That means you need to subtract one from $. or from the $reference variables before using them as indices. So your resulting data was never correct in the first place, and you may have overlooked that if it weren't for the warning!
I fixed that and also tidied up your code a little. I mostly added use autodie so that there's no need to check the status of IO operations (except for Tie::File), changed to list assignments, moved the code to read the files into a subroutine, and added code blocks so that the lexical file handles would be closed automatically
I also used the tied arrays to build the %map hashes instead of opening the files separately, which means their values are already zero-based as they need to be
Oh, and I removed looks_like_number, because the $reference variables must be either numeric or undef because that's all we put into the hash. The correct way to check that a value isn't undef is with the defined operator
#!/usr/bin/perl
use strict;
use warnings 'all';
use autodie;
use Fcntl 'O_RDONLY';
use Tie::File;
my ( $mapfile, $infile1, $infile2, $outfile ) = #ARGV;
{
tie my #file1, 'Tie::File' => $infile1, mode => O_RDONLY
or die "Could not open $infile1: $!";
tie my #file2, 'Tie::File' =>$infile2, mode => O_RDONLY
or die "Could not open $infile2: $!";
my %map1 = map { (split /\t/, $file1[$_], 3)[1] => $_ } 0 .. $#file1;
my %map2 = map { (split /\t/, $file2[$_], 3)[1] => $_ } 0 .. $#file2;
open my $map_fh, '<', $mapfile;
open my $out_fh, '>', $outfile;
while ( <$map_fh> ) {
chomp;
my #row = split /\t/;
my ( $species1, $species2 ) = #row[1,3];
my $reference1 = $map1{$species1};
my $reference2 = $map2{$species2};
my #nomatch = ( "NA", "", "NA", "", "", "", "", "NA", "NA" );
my #fields = (
( defined $reference1 ? $file1[$reference1] : #nomatch),
( defined $reference2 ? $file2[$reference2] : #nomatch),
);
print $out_fh join( "\t", #fields ), "\n";
}
}
output
Scf_3L 12798910 T 0 41 0 0 NA NA NA NA NA NA
Scf_3L 12798911 C 0 0 43 0 NA NA NA NA NA NA
Scf_3L 12798912 A 42 0 0 0 NA NA NA NA NA NA
Scf_3L 12798913 G 0 0 0 44 NA NA NA NA NA NA
Scf_3L 12798914 T 0 42 0 0 NA NA NA NA NA NA
Scf_3L 12798915 G 0 0 0 44 NA NA NA NA NA NA
Scf_3L 12798916 T 0 42 0 0 NA NA NA NA NA NA
Scf_3L 12798917 A 41 0 0 0 NA NA NA NA NA NA
Scf_3L 12798918 G 0 0 0 43 NA NA NA NA NA NA
Scf_3L 12798919 T 0 43 0 0 NA NA NA NA NA NA
Scf_3L 12798920 T 0 41 0 0 NA NA NA NA NA NA