Reformat Dates in Perl (for later use in MySQL) - perl

I am writing a Perl script to open a .csv file, make some changes, sort it on four fields, then write it back to a new file. Found out that because this data will then be used to load a MySQL table that I also need to reformat the Date variables. Currently, Dates are in the file as 00/00/0000 and for MySQL, need to have them formatted as 0000-00-00. Right now, I simply tried to do it for one field, although I actually need to do it on three Date fields for each line from the .csv file.
This script is running - but it is not reformatting the Date field I'm trying to test this on.
#!/usr/bin/perl/
use strict;
use warnings;
use Data::Dumper;
#my $filename = '/swpkg/shared/batch_processing/mistints/mistints.csv';
my $filename = 'tested.csv';
open my $FH, $filename
or die "Could not read from $filename <$!>, program halting.";
# Read the header line.
chomp(my $line = <$FH>);
my #fields = split(/,/, $line);
#print "Field Names:\n", Dumper(#fields), $/;
print Dumper(#fields), $/;
my #data;
# Read the lines one by one.
while($line = <$FH>) {
# split the fields, concatenate the first three fields,
# and add it to the beginning of each line in the file
chomp($line);
my #fields = split(/,/, $line);
unshift #fields, join '_', #fields[0..2];
push #data, \#fields;
my $in_date = $fields[14];
my $db_date = join '-', reverse split /\D/, $in_date;
}
close $FH;
print "Unsorted:\n", Dumper(#data); #, $/;
#data = sort {
$a->[0] cmp $b->[0] ||
$a->[20] cmp $b->[20] ||
$a->[23] cmp $b->[23] ||
$a->[26] cmp $b-> [26]
} #data;
open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/parsedMistints.csv';
#print $OFH Dumper(#data);
print $OFH join(',', #$_), $/ for #data;
close $OFH;
#print "Sorted:\n", Dumper(#data);
#print "Sorted:", Dumper(#data);
exit;
The two lines I added to this script (which are not working) are the my $in_date and my $db_date lines. Now I will also need to reformat two fields (at the end of each line) that are DATETIME, i.e. 10/23/2015 10:47, where I will only need to reformat the date within that field, and I'm not even sure where to begin tackling that one.
And please go easy since I'm a noob with Perl.
EDIT - SORRY, had to re-edit because I didn't notice the first part of my script had not copied.

Rather than using a bunch of string functions, it's better to use the Time::Piece module to parse and reformat date-time values. It has strptime and strftime methods to do this for you. This short program shows the reformatting of both date-time formats that you mention. ymd is a convenience method, and is equivalent to strftime('%Y-%m-%d')
use strict;
use warnings 'all';
use feature 'say';
use Time::Piece;
my $in_date = '01/02/2003';
my $db_date = Time::Piece->strptime($in_date, '%m/%d/%Y')->ymd;
say "$in_date -> $db_date";
$in_date = '01/02/2003 04:05';
$db_date = Time::Piece->strptime($in_date, '%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M');
say "$in_date -> $db_date";
output
01/02/2003 -> 2003-02-01
01/02/2003 04:05 -> 2003-02-01 04:05
Update
If you prefer, you could write a subroutine that takes the original date and its format string, together with the desired format. Like this
use strict;
use warnings 'all';
use feature 'say';
use Time::Piece;
my $in_date = '01/02/2003';
my $db_date = date_from_to($in_date, '%m/%d/%Y', '%Y-%m-%d');
say "$in_date -> $db_date";
$in_date = '01/02/2003 04:05';
$db_date = date_from_to($in_date, '%m/%d/%Y %H:%M', '%Y-%m-%d %H:%M');
say "$in_date -> $db_date";
sub date_from_to {
my ($date, $from, $to) = #_;
Time::Piece->strptime($date, $from)->strftime($to);
}
The output is identical to that of the program above
Update
Regarding your comment, your code should look like this
$_ = join '-', (split /\//)[2,0,1] for $fields[14, 20, 23];
$_ = Time::Piece->strptime($_,'%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M') for #fields[38,39];
push #data, \#fields;
But I would prefer to see some consistency in the way the date fields are handled, like this
$_ = Time::Piece->strptime($_, '%m/%d/%Y')->strftime('%Y-%m-%d') for #fields[14,20,23];
$_ = Time::Piece->strptime($_, '%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M') for #fields[38,39];
push #data, \#fields

Related

Strip out comma from position variable

I have a file:
434462PW1 5
76252PPP8 5,714.79
76252PMB2 16,950.17
76252PRC5 25,079.70
76252PNY1 30,324.50
62630WCQ8 1.09
62630WCZ8 1.09
62630WBX4 36,731.90
62630WCQ8 1.07
62630WCZ8 1.07
76252PGB9 1.07
62630WBN6 1.07
62630WBA4 1.07
I need the commas stripped out of the second value, and a comma added between the 1st and 2nd values.
434462PW1,5
76252PPP8,5714.79
76252PMB2,16950.17
76252PRC5,25079.70
76252PNY1,30324.50
62630WCQ8,1.09
62630WCZ8,1.09
62630WBX4,36731.90
62630WCQ8,1.07
62630WCZ8,1.07
76252PGB9,1.07
62630WBN6,1.07
62630WBA4,1.07
Here is the code. I'm having trouble stripping just the number values.
#!/usr/bin/perl
use strict ;
use warnings;
open my $handle, '<', "foofile";
chomp(my #positionArray = <$handle>);
foreach my $pos(#positionArray) {
if ($pos =~ /(\w{9})\s+(.*)/) {
if ($2=~/,/) {
my $without = $2=~s/,//g ;
print "$1,$without\n";
}
}
}
Since commas only appear in the 2nd column, you can simply delete all commas from each line. Also, since whitespace only exists between your 2 columns, you can then replace all space with a comma.
foreach my $pos (#positionArray) {
$pos =~ s/,//g;
$pos =~ s/\s+/,/;
print "$pos\n";
}
Another way is you can sorted out this issues using map function (Input and output # array variable).
chomp(my #positionArray = <$handle>);
my #out = map { $_=~s/\,//g; $_=~s/\s+/,/; $_; } #positionArray;
use Data::Dumper;
print Dumper \#out;
For inexplicable reason you made code more complicated than it may be
use strict ;
use warnings;
use feature 'say';
my $filename = 'foofile';
open my $fh, '<', $filename
or die "Couldn't open $filename $!";
my #lines = <$fh>;
close $fh;
chomp #lines; # snip eol
for (#lines) {
my($id,$val) = split;
$val =~ s/,//; # modifier 'g' might be used if value goes beyond thousands
say "$id,$val";
}
Output
434462PW1,5
76252PPP8,5714.79
76252PMB2,16950.17
76252PRC5,25079.70
76252PNY1,30324.50
62630WCQ8,1.09
62630WCZ8,1.09
62630WBX4,36731.90
62630WCQ8,1.07
62630WCZ8,1.07
76252PGB9,1.07
62630WBN6,1.07
62630WBA4,1.07

Split my output into multiple files

I have the following list in a CSV file, and my goal is to split this list into directories named YYYY-Month based on the date in each row.
NAME99;2018/06/13;12:27:30
NAME01;2018/06/13;13:03:59
NAME00;2018/06/15;11:33:01
NAME98;2018/06/15;12:22:00
NAME34;2018/06/15;16:58:45
NAME17;2018/06/18;15:51:10
NAME72;2018/06/19;10:06:37
NAME70;2018/06/19;12:44:03
NAME77;2018/06/19;16:36:55
NAME25;2018/06/11;16:32:57
NAME24;2018/06/11;16:32:57
NAME23;2018/06/11;16:37:15
NAME01;2018/06/11;16:37:15
NAME02;2018/06/11;16:37:15
NAME01;2018/06/11;16:37:18
NAME02;2018/06/05;09:51:17
NAME00;2018/06/13;15:04:29
NAME07;2018/06/19;10:02:26
NAME08;2018/06/26;16:03:57
NAME09;2018/06/26;16:03:57
NAME02;2018/06/27;16:58:12
NAME03;2018/07/03;07:47:21
NAME21;2018/07/03;10:53:00
NAMEXX;2018/07/05;03:13:01
NAME21;2018/07/05;15:39:00
NAME01;2018/07/05;16:00:14
NAME00;2018/07/08;11:50:10
NAME07;2018/07/09;14:46:00
What is the smartest method to achieve this result without having to create a list of static routes, in which to carry out the append?
Currently my program writes this list to a directory called YYYY-Month only on the basis of localtime but does not do anything on each line.
Perl
#!/usr/bin/perl
use strict;
use warnings 'all';
use feature qw(say);
use File::Path qw<mkpath>;
use File::Spec;
use File::Copy;
use POSIX qw<strftime>;
my $OUTPUT_FILE = 'output.csv';
my $OUTFILE = 'splitted_output.csv';
# Output to file
open( GL_INPUT, $OUTPUT_FILE ) or die $!;
$/ = "\n\n"; # input record separator
while ( <GL_INPUT> ) {
chomp;
my #lines = split /\n/;
my $i = 0;
foreach my $lines ( #lines ) {
# Encapsulate Date/Time
my ( $name, $y, $m, $d, $time ) =
$lines[$i] =~ /\A(\w+);(\d+)\/(\d+)\/(\d+);(\d+:\d+:\d+)/;
# Generate Directory YYYY-Month - #2009-January
my $dir = File::Spec->catfile( $BASE_LOG_DIRECTORY, "$y-$m" ) ;
unless ( -e $dir ) {
mkpath $dir;
}
my $log_file_path = File::Spec->catfile( $dir, $OUTFILE );
open( OUTPUT, '>>', $log_file_path ) or die $!;
# Here I append value into files
print OUTPUT join ';', "$y/$m/$d", $time, "$name\n";
$i++;
}
}
close( GL_INPUT );
close( OUTPUT );
There is no reason to care about the actual date, or to use date functions at all here. You want to split up your data based on a partial value of one of the columns in the data. That just happens to be the date.
NAME08;2018/06/26;16:03:57 # This goes to 2018-06/
NAME09;2018/06/26;16:03:57 #
NAME02;2018/06/27;16:58:12 #
NAME03;2018/07/03;07:47:21 # This goes to 2018-07/
NAME21;2018/07/03;10:53:00 #
NAMEXX;2018/07/05;03:13:01 #
NAME21;2018/07/05;15:39:00 #
The easiest way to do this is to iterate your input data, then stick it into a hash with keys for each year-month combination. But you're talking about log files, and they might be large, so that's inefficient.
We should work with different file handles instead.
use strict;
use warnings;
my %months = ( 6 => 'June', 7 => 'July' );
my %handles;
while (my $row = <DATA>) {
# no chomp, we don't actually care about reading the whole row
my (undef, $dir) = split /;/, $row; # discard name and everything after date
# create the YYYY-MM key
$dir =~ s[^(....)/(..)][$1-$months{$2}];
# open a new handle for this year/month if we don't have it yet
unless (exists $handles{$dir}) {
# create the directory (skipped here) ...
open my $fh, '>', "$dir/filename.csv" or die $!;
$handles{$dir} = $fh;
}
# write out the line to the correct directory
print { $handles{$dir} } $row;
}
__DATA__
NAME08;2018/06/26;16:03:57
NAME09;2018/06/26;16:03:57
NAME02;2018/06/27;16:58:12
NAME03;2018/07/03;07:47:21
NAME21;2018/07/03;10:53:00
NAMEXX;2018/07/05;03:13:01
NAME21;2018/07/05;15:39:00
I've skipped the part about creating the directory as you already know how to do this.
This code will also work if your rows of data are not sequential. It's not the most efficient as the number of handles will grow the more data you have, but as long you don't have 100s of them at the same time that does not really matter.
Things of note:
You don't need chomp because you don't care about working with the last field.
You don't need to assign all of the values after split because you don't care about them.
You can discard values by assigning them to undef.
Always use three-argument open and lexical file handles.
the {} in print { ... } $roware needed to tell Perl that this is the handle we are printing too. See http://perldoc.perl.org/functions/print.html.

Using Parse::CSV to limit splits

I am trying to use Parse::CSV to parse through a simple CSV file with a header and 2 columns. The second column may contain commas but I want to ignore them. Is there anyway to limit how many times it splits on commas? Here is what I have so far
#!/usr/bin/perl
use Parse::CSV;
my $csv = Parse::CSV->new(file => 'file.csv');
while (my $row = $csv->fetch) {
print $row->[0] . "\t" . $row->[1] . "\n";
}
Here is an example of what my data looks like:
1234,text1,text2
5678,text3
90,text4,text5
This would return
1234 text1,text2
5678 text3
90 text4,text5
If you're really wed to Parse::CSV, you can do this using a filter:
use strict;
use warnings;
use 5.010;
use Parse::CSV;
my $parser = Parse::CSV->new(
file => 'input.csv',
filter => sub { return [ shift #$_, join(',', #$_) ] }
);
while ( my $row = $parser->fetch ) {
say join("\t", #$row);
}
die $parser->errstr if $parser->errstr;
Output:
1234 text1,text2
5678 text3
90 text4,text5
Note that performance will be poor because Parse::CSV is splitting the columns for you, but then you immediately join them back together again.
However, since it appears that you're not working with a true CSV (columns containing the delimiter aren't quoted or escaped in any way), why not just use split with a third argument to specify the maximum number of fields?
use strict;
use warnings;
use 5.010;
open my $fh, '<', 'input.csv' or die $!;
while (<$fh>) {
chomp;
my #fields = split(',', $_, 2);
say join("\t", #fields);
}
close $fh;

How to randomly pair items in a list

I have a list of Accession numbers that I want to pair randomly using a Perl script below:
#!/usr/bin/perl -w
use List::Util qw(shuffle);
my $file = 'randomseq_acc.txt';
my #identifiers = map { (split /\n/)[1] } <$file>;
chomp #identifiers;
#Shuffle them and put in a hash
#identifiers = shuffle #identifiers;
my %pairs = (#identifiers);
#print the pairs
for (keys %pairs) {
print "$_ and $pairs{$_} are partners\n";
but keep getting errors.
The accession numbers in the file randomseq_acc.txt are:
1094711
1586007
2XFX_C
Q27031.2
P22497.2
Q9TVU5.1
Q4N4N8.1
P28547.2
P15711.1
AAC46910.1
AAA98602.1
AAA98601.1
AAA98600.1
EAN33235.2
EAN34465.1
EAN34464.1
EAN34463.1
EAN34462.1
EAN34461.1
EAN34460.1
I needed to add the closing right curly brace to be able to compile the script.
As arrays are indexed from 0, (split /\n/)[1] returns the second field, i.e. what follows newline on each line (i.e. nothing). Change it to [0] to make it work:
my #identifiers = map { (split /\n/)[0] } <$file>; # Still wrong.
The diamond operator needs a file handle, not a file name. Use open to associate the two:
open my $FH, '<', $file or die $!;
my #identifiers = map { (split /\n/)[0] } <$FH>;
Using split to remove a newline is not common. I'd probably use something else:
map { /(.*)/ } <$FH>
# or
map { chomp; $_ } <$FH>
# or, thanks to ikegami
chomp(my #identifiers = <$FH>);
So, the final result would be something like the following:
#!/usr/bin/perl
use warnings;
use strict;
use List::Util qw(shuffle);
my $filename = '...';
open my $FH, '<', $filename or die $!;
chomp(my #identifiers = <$FH>);
my %pairs = shuffle(#identifiers);
print "$_ and $pairs{$_} are partners\n" for keys %pairs;

Selecting records from a file based on keys from a second file

My first file looks like:
CHR id position
1 rs58108140 10583
1 rs189107123 10611
1 rs180734498 13302
1 rs144762171 13327
1 chr1:13957:D 13957
And my second file looks like:
CHR SNP POS RiskAl OTHER_ALLELE RAF logOR Pval
10 rs1999138 110140096 T C 0.449034245446375 0.0924443 1.09e-06
6 rs7741604 20839503 C A 0.138318264238111 0.127947 1.1e-06
8 rs1486006 82553172 G C 0.833130882716561 0.147456 1.12727730194884e-06
My script reads in the first file and stores it in an array, and then I would like to find rsIDs from column 2 of the first file that are in column 2 in the second file. I think I am having a problem with how I'm matching the expressions. Here is my script:
#! perl -w
use strict;
use warnings;
my $F = shift #ARGV;
my #snps;
open IN, "$F";
while (<IN>) {
next if m/CHR/;
my #L = split;
push #snps, [$L[0], $L[1], $L[2]] if $L[0] !~ m/[XY]/;
}
close IN;
open IN, "DIAGRAMv3sansWTCCCqc0clumpd_noTCF7L2regOrLeadOrPlt1em6clumps- CHR_SNP_POS_RiskAl_OtherAl_RAF_logOR_Pval.txt";
while (<IN>) {
my #L = split;
next if m/CHR/;
foreach (#snps) {
next if ($L[0] != ${$_}[0]);
# if not on same chromosome
if ($L[0] = ${$_}[0]) {
# if on same chromosome
if ($L[1] =~ ${$_}[1]) {
print "$L[0] $L[1] ${$_}[2]\n";
last;
}
}
}
}
Your code doesn't seem to correspond to your description. You are comparing both the first and second columns of the file rather than just the second.
The main problems are:
You use $L[0] = ${$_}[0] to compare the first columns. This will do an assigmment instead of a comparison. You should use $L[0] == ${$_}[0] instead or, better, $L[0] == $_->[0]
You use $L[1] =~ ${$_}[1] to compare the second columns. This will check whether ${$_}[1] is a substring of $L[1]. You could use anchors like $L[1] =~ /^${$_}[1]$/ but it's much better to just do a string comparison as $L[1] eq $_->[1]
The easiest way is to read the second file first so as to build a list of values that you want included from the first file. I have written it so that it does what your code looks like it's supposed to do, i.e. match the first two columns.
That would look like this
use strict;
use warnings;
use autodie;
my ($f1, $f2) = #_;
my %include;
open my $fh2, '<', $f2;
while (<$fh2>) {
my #fields = split;
my $key = join '|', #fields[0,1];
++$include{$key};
}
close $fh2;
open my $fh1, '<', $f1;
while (<$fh1>) {
my #fields = split;
my $key = join '|', #fields[0,1];
print "#fields[0,1,2]\n" if $include{$key};
}
close $fh1;
output
Unfortunately your choice of sample data doesn't include any records in the first file that have matching keys in the second, so there is no output!
Update
This is a corrected version of your own program. It should work, but it is far more efficient and concise to use hashes, as above
use strict;
use warnings;
use autodie;
my ($filename) = #ARGV;
my #snps;
open my $in_fh, '<', $filename;
<$in_fh>; # Discard header line
while (<$in_fh>) {
my #fields = split;
push #snps, \#fields unless $fields[0] =~ /[XY]/;
}
close $in_fh;
open $in_fh, '<', 'DIAGRAMv3sansWTCCCqc0clumpd_noTCF7L2regOrLeadOrPlt1em6clumps- CHR_SNP_POS_RiskAl_OtherAl_RAF_logOR_Pval.txt';
<$in_fh>; # Discard header line
while (<$in_fh>) {
my #fields = split;
for my $snp (#snps) {
next unless $fields[0] == $snp->[0] and $fields[1] eq $snp->[1];
print "$fields[0] $fields[1] $snp->[2]\n";
last;
}
}
close $in_fh;