Perl Regular expression extract - perl

I'm trying to extract a certain string of numbers from a text file using a regular exression, but when my code runs, it is grabbing the numbers after the slash in the separation between date and time. Here is what I have so far.
while ( <INFILE> ) {
my #fields = split( /\ /, $_ );
my #output;
foreach my $field ( #fields ) {
if ( $field =~ /[0-9]{5}\// ) {
push #output, $field;
}
}
if ( #output ) {
my $line = join( ' ', #output );
print "$line\n";
print OUTFILE "$line\n";
}
}
The line I am trying to extract data from is
D2001235 9204 254/2004 254/1944 254/2041 15254/2011 ALL-V4YM 001 AUTO C-C0000
The data I need is the 15254 but when I run my code it returns 15254/2011 and my program is erroring out.

The problem is that you are storing the entire $field in the output array, but you only want the number to the left of the slash to be stored. You could use capturing parentheses in the regular expression and the $1 special variable. This outputs 15254:
use warnings;
use strict;
while (<DATA>) {
my #fields = split( /\ /, $_ );
my #output;
foreach my $field (#fields) {
if ( $field =~ /^([0-9]{5})\// ) {
push #output, $1;
}
}
if (#output) {
my $line = join( ' ', #output );
print "$line\n";
}
}
__DATA__
D2001235 9204 254/2004 254/1944 254/2041 15254/2011 ALL-V4YM 001 AUTO C-C0000

As explained, you are saving an entire field in #output if it matches the regex, instead of just the first part before the slash
Your split is also unnecessarily complicated, and join isn't needed
All you need is this
while ( <INFILE> ) {
my #output = map m{^([0-9]{5})/}, split;
if ( #output ) {
print "#output\n";
print OUTFILE "#output\n";
}
}

Related

Double hash entries while printing

As part of a bigger program I've got a hash. I'm testing the program printing the keys but they are all duplicates, I don't know why
while ( my $line = <SEQ> ) {
chomp $line;
$line =~ s/>//;
my ( #split1 ) = split( "\t", $line );
foreach my $chr ( keys %position ) {
#print Dumper \%position;
print "$chr\n";
foreach my $pos ( sort keys %{ $position{$chr} } ) {
if ( $split1[0] =~ /$chr/ ) {
#print "$chr\t$pos\n";
}
}
}
}
%position is a nested hash, when I print the keys on print "$chr\n"; they are all doubled and I don't understand why.
The file opened on handle SEQ looks like this:
>chr1\tACTGTAGTCTCATCCTAT...
>chr2\tACGTAGCTAGT....
and so on
You have a foreach loop inside a while which will print all of the keys of the %position hash for every line of input
That will cause symptoms like what you're describing. Is that what you're looking for?
Start with this:
my %data;
while(<DATA>){
chomp;
my ($chr, $seq) = split;
$data{$chr} = $seq;
}
print Dumper \%data;

Perl parse CSV file "fill" and "null" fields

Okay - I'm going to post my entire script since I get chastised when I don't do it - even though, last time I did that I got chastised for posting the whole script. I simply need to know if the one line I originally asked about would work. ENTIRE SCRIPT (which was working just fine until the other dept gave me their data entirely differently than what we were originally told it would be) TO FOLLOW AT THE END
I'm parsing through and scrubbing a CSV file to make it ready to be loaded in a MySQL table. It is loaded through the table via someone else's "batch Java program" and if any field is empty the batch file stops with an error.
I've been told to just put in a blank space whenever there's an empty field in any record. Would something as simple as this work?
if ( ! length $fields[2] ) {
$_ = ' ' for $fields[2];
}
And would there be a way to check either various multiple fields at once? Or what might be better would be to check ALL the fields (this is after the record has been split) as the last thing I do just before writing the record back out to the CSV file.
Here's the entire script. Please don't tell me how what I'm doing within the already working script is not how you would do it. -
#!/usr/bin/perl/
use strict;
use warnings;
use Data::Dumper;
use Time::Piece;
my $filename = 'mistints_1505_comma.csv';
#my $filename = 'test.csv';
# Open input file
open my $FH, $filename
or die "Could not read from $filename <$!>, program halting.";
# Open error handling file
open ( my $ERR_FH, '>', "errorFiles1505.csv" ) or die $!;
# Read the header line of the input file and print to screen.
chomp(my $line = <$FH>);
my #fields = split(/,/, $line);
print Dumper(#fields), $/;
my #data;
# Read the lines one by one.
while($line = <$FH>) {
chomp($line);
# Scrub data of characters that cause scripting problems down the line.
$line =~ s/[\'\\]/ /g;
# split the fields of each record
my #fields = split(/,/, $line);
# Check if the storeNbr field is empty. If so, write record to error file.
if (!length $fields[28]) {
chomp (#fields);
my $str = join ',', #fields;
print $ERR_FH "$str\n";
}
else
{
# Concatenate the first three fields and add to the beginning of each record
unshift #fields, join '_', #fields[28..30];
# Format the DATE fields for MySQL
$_ = join '-', (split /\//)[2,0,1] for #fields[10,14,24,26];
# Scrub colons from the data
$line =~ s/:/ /g;
# If Spectro_Model is "UNKNOWN", change
if($fields[22] eq "UNKNOWN"){
$_ = 'UNKNOW' for $fields[22];
}
# If tran_date is blank, insert 0000-00-00
if(!length $fields[10]){
$_ = '0000-00-00' for $fields[10];
}
# If init_tran_date is blank, insert 0000-00-00
if(!length $fields[14]){
$_ = '0000-00-00' for $fields[14];
}
# If update_tran_date is blank, insert 0000-00-00
if(!length $fields[24]){
$_ = '0000-00-00' for $fields[24];
}
# If cancel_date is blank, insert 0000-00-00
if(!length $fields[26]){
$_ = '0000-00-00' for $fields[26];
}
# Format the PROD_NBR field by deleting any leading zeros before decimals.
$fields[12] =~ s/^\s*0\././;
# put the records back
push #data, \#fields;
}
}
close $FH;
close $ERR_FH;
print "Unsorted:\n", Dumper(#data); #, $/;
#Sort the clean files on Primary Key, initTranDate, updateTranDate, and updateTranTime
#data = sort {
$a->[0] cmp $b->[0] ||
$a->[14] cmp $b->[14] ||
$a->[26] cmp $b->[26] ||
$a->[27] cmp $b-> [27]
} #data;
#open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/parsedMistints.csv';
open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/cleaned1505.csv';
print $OFH join(',', #$_), $/ for #data;
close $OFH;
exit;
As far as I can tell you have split a record on commas ,, and you want to alter all fields that are empty strings to contain a single space
I would write this
use strict;
use warnings 'all';
my $record = 'a,b,c,,e,,g,,i,,k,,m,n,o,p,q,r,s,t';
my #fields = map { $_ eq "" ? ' ' : $_ } split /,/, $record;
use Data::Dump;
dd \#fields;
output
[ "a", "b", "c", " ", "e", " ", "g", " ", "i", " ", "k", " ", "m" .. "t" ]
Alternatively, if you have some fields that need to be set to something different if they are empty, you can set up an array of defaults
That would look like this. All of the #defaults array is set to spaces except for fields 10, 11 and 12, which are 0000-00-00. These are picked up after the record is split
use strict;
use warnings 'all';
my #defaults = (' ') x 20;
$defaults[$_] = '0000-00-00' for 9, 10, 11;
my $record = 'a,b,c,,e,,g,,i,,k,,m,n,o,p,q,r,s,t';
my #fields = split /,/, $record;
for my $i ( 0 .. $#fields ) {
$fields[$i] = $defaults[$i] if $fields[$i] eq '';
}
use Data::Dump;
dd \#fields;
output
[ "a", "b", "c", " ", "e", " ", "g", " ", "i", "0000-00-00", "k", "0000-00-00", "m" .. "t" ]
Having seen your full program, I recommend something like this. If you had shown a sample of your input data then I could have used a hash to refer to column names instead of numbers, making it much more readable
#!/usr/bin/perl/
use strict;
use warnings 'all';
use Data::Dumper;
use Time::Piece;
my $filename = 'mistints_1505_comma.csv';
#my $filename = 'test.csv';
open my $FH, $filename
or die "Could not read from $filename <$!>, program halting.";
open( my $ERR_FH, '>', "errorFiles1505.csv" ) or die $!;
chomp( my $line = <$FH> );
my #fields = split /,/, $line; #/
print Dumper( \#fields ), "\n";
my #data;
# Read the lines one by one.
while ( <$FH> ) {
chomp;
# Scrub data of characters that cause scripting problems down the line.
tr/'\\/ /; #'
my #fields = split /,/; #/
# Check if the storeNbr field is empty. If so, write record to error file.
if ( $fields[28] eq "" ) {
my $str = join ',', #fields;
print $ERR_FH "$str\n";
next;
}
# Concatenate the first three fields and add to the beginning of each record
unshift #fields, join '_', #fields[ 28 .. 30 ];
# Format the DATE fields for MySQL
$_ = join '-', ( split /\// )[ 2, 0, 1 ] for #fields[ 10, 14, 24, 26 ];
# Scrub colons from the data
tr/://d; #/
my $i = 0;
for ( #fields ) {
# If "Spectro_Model" is "UNKNOWN" then change to "UNKNOW"
if ( $i == 22 ) {
$_ = 'UNKNOW' if $_ eq 'UNKNOWN';
}
# If a date field is blank then insert 0000-00-00
elsif ( grep { $i == $_ } 10, 14, 24, 26 ) {
$_ = '0000-00-00' if $_ eq "";
}
# Format the PROD_NBR field by deleting any leading zeros before decimals.
elsif ( $i == 12 ) {
s/^\s*0\././;
}
# Change all remaining empty fields to a single space
else {
$_ = ' ' if $_ eq "";
}
++$i;
}
push #data, \#fields;
}
close $FH;
close $ERR_FH;
print "Unsorted:\n", Dumper(#data); #, $/;
#Sort the clean files on Primary Key, initTranDate, updateTranDate, and updateTranTime
#data = sort {
$a->[0] cmp $b->[0] or
$a->[14] cmp $b->[14] or
$a->[26] cmp $b->[26] or
$a->[27] cmp $b->[27]
} #data;
#open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/parsedMistints.csv';
open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/cleaned1505.csv' or die $!;
print $OFH join(',', #$_), $/ for #data;
close $OFH;
Well if you did it before splitting into $fields, you ought to be able to do something like
# assuming a CSV line is in $_
#pad null at start of line
s/^,/ ,/;
#pad nulls in the middle
s/,,/, ,/g;
#pad null at the end
s/,$/, /;
Don't try to roll out your own CSV parsing code. Use Text::CSV or Text::CSV::Slurp.
With Text::CSV you could do something like
$line = $csv->string(); # get the combined string
$status = $csv->parse($line); # parse a CSV string into fields
#columns = map {defined $_ ? $_ : " "} $csv->fields(); # get the parsed fields
Are you really sure you want to replace nulls with spaces? I'd say if the field is undefined it should be NULL in db.

Ignore the first two lines with ## in perl

all.
Im a newbie in programming especially in perl. I would like to skip the first two lines in my dataset.
these are my codes.
while (<PEPTIDELIST>) {
next if $_ !=~ "##";
chomp $_;
#data = split /\t/;
chomp $_;
next if /Sequence/;
chomp $_;
$npeptides++;
# print "debug: 0: $data[0] 1: $data[1] 2: $data[2] 3:
$data[3]
\n" if ( $debug );
my $pepseq = $data[1];
#print $pepseq."\n";
foreach my $header (keys %sequence) {
#print "looking for $pepseq in $header \n";
if ($sequence{$header} =~ /$pepseq/ ) {
print "matched $pepseq in protein $header" if ( $debug );
# my $in =<STDIN>;
if ( $header =~ /(ENSGALP\S+)\s.+(ENSGALG\S+)/ ) {
print "debug: $1 $2 have the pep = $pepseq \n\n" if (
$debug);
my $lprot = $1;
my $lgene = $2;
$gccount{$lgene}++;
$pccount{$lprot}++;
# print "$1" if($debug);
# print "$2" if ($debug);
print OUT "$pepseq,$1,$2\n";
}
}
}
my $ngenes = keys %gccount;
my $nprots = keys %pccount;
somehow the peptide is not in the output list. please help point me where it goes wrong?
thanks
If you want to skip lines that contain ## anywhere in them:
next if /##/;
If you only want to skip lines that start with ##:
next if /^##/;
If you always want to skip the first two lines, regardless of content:
next if $. < 3;
next if $_ !=~ "##"; must be next if $_ =~ "##";
Ignore this lie if $_ matched ##

Perl - sort filenames by order based on the filemask YYY-MM-DD that is in the filename

Need some help, not grasping a solution here on what method I should use.
I need to scan a directory and obtain the filenames by order of
1.YYYY-MM-DD, YYYY-MM-DD is part of the filename.
2. Machinename which is at the start of the filename to the left of the first "."
For example
Machine1.output.log.2014-02-26
Machine2.output.log.2014-02-27
Machine2.output.log.2014-02-26
Machine2.output.log.2014-02-27
Machine3.output.log.2014-02-26
So that it outputs in an array as follows
Machine1.output.log.2014-02-26
Machine2.output.log.2014-02-26
Machine3.output.log.2014-02-26
Machine1.output.log.2014-02-27
Machine2.output.log.2014-02-27
Thanks,
Often, temporarily turning your strings into a hash or array for sorting purposes, and then turning them back into the original strings is the most maintainable way.
my #filenames = qw/
Machine1.output.log.2014-02-26
Machine2.output.log.2014-02-27
Machine2.output.log.2014-02-26
Machine2.output.log.2014-02-27
Machine3.output.log.2014-02-26
/;
#filenames =
map $_->{'orig_string'},
sort {
$a->{'date'} cmp $b->{'date'} || $a->{'machine_name'} cmp $b->{'machine_name'}
}
map {
my %attributes;
#attributes{ qw/orig_string machine_name date/ } = /\A(([^.]+)\..*\.([^.]+))\z/;
%attributes ? \%attributes : ()
} #filenames;
You can define your own sort like so ...
my #files = (
"Abc1.xxx.log.2014-02-26"
, "Abc1.xxx.log.2014-02-27"
, "Abc2.xxx.log.2014-02-26"
, "Abc2.xxx.log.2014-02-27"
, "Abc3.xxx.log.2014-02-26"
);
foreach my $i ( #files ) { print "$i\n"; }
sub bydate {
(split /\./, $a)[3] cmp (split /\./, $b)[3];
}
print "sort it\n";
foreach my $i ( sort bydate #files ) { print "$i\n"; }
You can take your pattern 'YYYY-MM-DD' and match it to what you need.
#!/usr/bin/perl
use strict;
opendir (DIRFILES, ".") || die "can not open data file \n";
my #maplist = readdir(DIRFILES);
closedir(MAPS);
my %somehash;
foreach my $tmp (#maplist) {
next if $tmp =~ /^.{1,2}$/;
next if $tmp =~ /test/;
$tmp =~ /(\d{4})-(\d{2})-(\d{2})/;
$somehash{$tmp} = $1 . $2 . $3; # keep the original file name
# allows for duplicate dates
}
foreach my $tmp (keys %somehash) {
print "-->", $tmp, " --> " , $somehash{$tmp},"\n";
}
my #list= sort { $somehash{$a} <=> $somehash{$b} } keys(%somehash);
foreach my $tmp (#list) {
print $tmp, "\n";
}
Works, tested it with touch files.

perl "eq" doesn't work well. I can't found my fault

I'm writting a small server with perl. There is some small problem. When the client give me a sentence like this "op:xxx:xxx:xxx", I'll get op. then do things depending on what op is. It works will if the op is adduser and so on. (I use if $op eq "adduser"...)
But when I get a "getList:xxx:xxx" and I have get the $op = getList, it can't pass it like "if $op eq "getList"". I know, it must be my mistake. But I just can't find it.
thank you for everyone.
use warnings;
use strict;
package MyPackage;
use base qw(Net::Server);
our %data_base;
our %tag_base;
sub list {
my %resault;
foreach ( keys %tag_base) {
print STDERR $_ . "1";
my #tags = split /:/, $tag_base{$_};
foreach ( #tags) {
$resault{$_} ++;
}
}
my #tags;
foreach ( keys %resault) {
push #tags, "$_,$resault{$_}";
}
$_ = join ";", #tags;
print ;
print STDERR ;
}
sub users {
my $topic = shift;
my #users;
foreach ( keys %tag_base) {
push #users, $_ if $tag_base{$_} =~ /$topic/;
}
$_ = join ";", #users;
print ;
}
sub process_request {
my $self = shift;
my $person;
my #info;
while (<STDIN>) {
my #gets = split /:/, $_;
print STDERR "#gets\n";
# $data_base{shift #person} = join ":", #person;
my $op = shift #gets;
$op =~ s/\s//;
print STDERR $op . "\n";
if ( $op eq "adduser") {
my $user_name = shift #gets;
if ( exists $data_base{$user_name}) {
print "already_exist";
} else {
$data_base{$user_name} = join ":", #gets;
print "addUserSu";
}
} elsif ( $op eq "login") {
my $login_name = shift #gets;
my $login_pw = shift #gets;
if ( defined $data_base{$login_name}) {
$person = $data_base{$login_name};
#info = split /:/, $person;
$info[0] =~ s/\s+//;
if ($login_pw eq $info[0]) {
print "$person";
} else {
print "/$info[0]/";
}
} else {
print "unexist_user";
}
} elsif ( $op eq "addTag") {
my $tag_user = shift #gets;
$tag_base{$tag_user} = join ":", #gets;
print "addTagSu";
} elsif ( $op eq "getList") {
print STDERR "right";
&list;
} elsif ( $op eq "getUsers") {
&users;
}
}
}
MyPackage->run(port => 13800);
I can see two (simple) reasons this might fail:
$op =~ s/\s//;
You only remove one whitespace: The first one. If your intention is to strip all whitespace, you'd want s/\s+//g.
And second:
Random capital letters in strings, variable names and commands is Evil. eq is case sensitive, so if $op is "getlist", then if ($op eq "getList") will be false. Unless capitalization is important to you, you could do if (lc($op) eq "getlist").
Without sample input, expected output and actual output, this is however nothing more than guesswork.
Also, as a debug statement, this is useless:
print STDERR $op . "\n";
That is easily confused and overlooked. For example, if $op is empty, it just produces a blank line in your error log. Use:
print STDERR "OP is: '$op'\n";
Now you will be able to identify the line where $op should appear, and you will be more easily see whitespace surrounding it.
You are reading strings without chomping them.
i.e.
When you run your code :
addtag:fred:barney
The input is stored as fred => "barney\n"
when you getList, the output is :
barney
,1;
I suspect the client is expecting 1 line of output that reads :
barney,1;
So, just add a chomp in your code here :
while (<STDIN>) {
chomp;
my #gets = split /:/, $_;