Select rows based on text pattern - perl

I want to extract rows from a file that match a particular pattern and I want to do this for over 500 files. It should have the ability to retain the unique name of the file as well.
I used awk but then i have to do each file individually.
c:\>gawk "/S1901/" Census_Tract_*.csv > Census_Tract_*.csv
In the example shown in the link here (http://bit.ly/nMX8qh) I want to retain only those records that have S1901 in them. Apologies for the external link but i am not able to retain formatting of the table.
I found some perl code that I used to write it but it retains all the rows and does not select only those rows/records where the pattern matches. Any tips would be much appreciated. The perl code is below:
#perl -w
$pattern = "Subject_Census*.csv"; # process only those files that match pattern
while (defined ($in = glob($pattern))) {
($out = $in) =~ s/\.csv$/.outcsv/; # read from "xyz.in" and write to "xyz.out"
open (IN, "<", $in) or die "Can't open $in for reading: $!";
open (OUT,">>", $out) or die "Can't open $out for writing: $!";
while (<IN>) {
$mystring =~ /S1901/;
print OUT $_ if $mystring == 0;
}
close (IN) or die "Can't close $in: $!"; # good idea to do some housekeeping
close (OUT) or die "Can't close $out: $!";
}

Untested:
use strict;
use warnings;
use autodie;
my $files_list_filename = 'files.txt';
open my $fl, '<', $files_list_filename;
my #list_of_files = <$fl>;
chomp #list_of_files;
close $fl;
foreach my $file ( #list_of_files ) {
open my $test_fh, '<', $file;
while ( my $line = <$test_fh> ) {
if( $line =~ m/S1901/ ) {
print "$file at $.: $line";
}
}
close $test_fh;
}
Is that sort of what you had in mind? It opens a file named filelist.txt and reads in a list of however many filenames you want to give it. Then it iterates over that list, opening each file one by one, scanning each file one by one, and if a line is found containing the trigger text, it prints the filename and line number, as well as the line itself where the trigger was met. Then it moves on to the next.

perl -ni.bak -e 'print if /S1901/' Subject_Census*.csv

Related

Counting number of lines with conditions

This is my script count.pl, I am trying to count the number of lines in a file.
The script's code :
chdir $filepath;
if (-e "$filepath"){
$total = `wc -l < file.list`;
printf "there are $total number of lines in file.list";
}
i can get a correct output, but i do not want to count blank lines and anything in the file that start with #. any idea ?
As this is a Perl program already open the file and read it, filtering out lines that don't count with
open my $fh, '<', $filename or die "Can't open $filename: $!";
my $num_lines = grep { not /^$|^\s*#/ } <$fh>;
where $filename is "file.list." If by "blank lines" you mean also lines with spaces only then chagne regex to /^\s*$|^\s*#/. See grep, and perlretut for regex used in its condition.
That filehandle $fh gets closed when the control exits the current scope, or add close $fh; after the file isn't needed for processing any more. Or, wrap it in a block with do
my $num_lines = do {
open my $fh, '<', $filename or die "Can't open $filename: $!";
grep { not /^$|^\s*#/ } <$fh>;
};
This makes sense doing if the sole purpose of opening that file is counting lines.
Another thing though: an operation like chdir should always be checked, and then there is no need for the race-sensitive if (-e $filepath) either. Altogether
# Perhaps save the old cwd first so to be able to return to it later
#my $old_cwd = Cwd::cwd;
chdir $filepath or die "Can't chdir to $filepath: $!";
open my $fh, '<', $filename or die "Can't open $filename: $!";
my $num_lines = grep { not /^$|^\s*#/ } <$fh>;
A couple of other notes:
There is no reason for printf. For all normal prints use say, for which you need use feature qw(say); at the beginning of the program. See feature pragma
Just in case, allow me to add: every program must have at the beginning
use warnings;
use strict;
Perhaps the original intent of the code in the question is to allow a program to try a non-existing location, and not die? In any case, one way to keep the -e test, as asked for
#my $old_cwd = Cwd::cwd;
chdir $filepath or warn "Can't chdir to $filepath: $!";
my $num_lines;
if (-e $filepath) {
open my $fh, '<', $filename or die "Can't open $filename: $!";
$num_lines = grep { not /^$|^\s*#/ } <$fh>;
}
where I still added a warning if chdir fails. Remove that if you really don't want it. I also added a declaration of the variable that is assigned the number of lines, with my $total_lines;. If it is declared earlier in your real code then of course remove that line here.
perl -ne '$n++ unless /^$|^#/ or eof; print "$n\n" if eof'
Works with multiple files too.
perl -ne '$n++ unless /^$|^#/ or eof; END {print "$n\n"}'
Better for a single file.
open(my $fh, '<', $filename);
my $n = 0;
for(<$fh>) { $n++ unless /^$|^#/}
print $n;
Using sed to filter out the "unwanted" lines in a single file:
sed '/^\s*#/d;/^\s*$/d' infile | wc -l
Obviously, you can also replace infile with a list of files.
The solution is very simple, no any magic.
use strict;
use warnings;
use feature 'say';
my $count = 0;
while( <> ) {
$count++ unless /^\s*$|^\s*#/;
}
say "Total $count lines";
Reference:
<>

Open two text files, process them and write to separate files

I'm using with Perl to open two text files, process them and then write the output to another file.
I have a file INPUT were every line is a customer. I will process each line into variables that will be used to substitute text in another file, TEMP. The result should be written into individual files for each customer, OUTPUT.
My program seems to be working on only the first file. The rest of the files remain empty with no output.
#!/usr/bin/perl -w
if ( $#ARGV < 0) {
print "Usage: proj5.pl <mm/dd/yyyy>\n";
exit;
}
my $date = $ARGV[0];
open(INFO, "p5Customer.txt") or die("Could not open p5Customer.txt file\n");
open(TEMP, "template.txt") or die("Could not open template.txt file\n");
my $directory = "Emails";
mkdir $directory unless(-e $directory);
foreach $info (<INFO>){
($email, $fullname, $title, $payed, $owed) = split /,/, $info;
next if($owed < $payed);
chomp($owed);
$filepath = "$directory/$email";
unless(open OUTPUT, '>>'.$filepath){
die "Unable to create '$filepath'\n";
}
foreach $detail (<TEMP>){
$detail =~ s/EMAIL/$email/g;
$detail =~ s/(NAME|FULLNAME)/$fullname/g;
$detail =~ s/TITLE/$title/g;
$detail =~ s/AMOUNT/$owed/g;
$detail =~ s{DATE}{$date}g;
print OUTPUT $detail;
}
close(OUTPUT);
}
close(INFO);
close(TEMP);
As has been said, you need to open your template file again each time you read from it. There's a bunch of other issues with your code too
Always use strict and use warnings 'all' and declare every variable with my as close as possible to where it is first used
$#ARGV is the index of the last element of #ARGV, so $#ARGV < 0 is much better written as #ARGV < 1
You should use lexical file handles, and the three-parameter form of open, so open(INFO, "p5Customer.txt") should be open my $info_fh, '<', "p5Customer.txt"
You should use while instead of for to read from a file
It is easier to use the default variable $_ for short loops
It is pointless to capture a substring in a regular expression if you're not going to use it, so (NAME|FULLNAME) should be NAME|FULLNAME
There is no point in closing input files before the end of your program
It is also much better to use an existing template system, such as
Template::Toolkit
This should work for you
#!/usr/bin/perl
use strict;
use warnings 'all';
if ( #ARGV < 1 ) {
print "Usage: proj5.pl <mm/dd/yyyy>\n";
exit;
}
my $date = $ARGV[0];
open my $info_fh, '<', 'p5Customer.txt' or die qq{Could not open "p5Customer.txt" file: $!};
my $directory = "Emails";
mkdir $directory unless -e $directory;
while ( <$info_fh> ) {
chomp;
my ($email, $fullname, $title, $payed, $owed) = split /,/;
next if $owed < $payed;
open my $template_fh, '<', 'template.txt' or die qq{Could not open "template.txt" file: $!};
my $filepath = "$directory/$email";
open my $out_fh, '>', $filepath or die qq{Unable to create "$filepath": $!};
while ( <$template_fh> ) {
s/EMAIL/$email/g;
s/FULLNAME|NAME/$fullname/g;
s/TITLE/$title/g;
s/AMOUNT/$owed/g;
s/DATE/$date/g;
print $out_fh $_;
}
close($out_fh);
}
Your problem is that the TEMP loop is inside the INPUT loop and so the TEMP loop will end while the INPUT loop is still on the first line of the INPUT file.
Best to store TEMP file data into a hash table and work on the TEMP hash table inside the INPUT loop.
Good luck.

Search string with multiple words in the pattern

My program is trying to search a string from multiple files in a directory. The code searches for single patterns like perl but fails to search a long string like Status Code 1.
Can you please let me know how to search for strings with multiple words?
#!/usr/bin/perl
my #list = `find /home/ad -type f -mtime -1`;
# printf("Lsit is $list[1]\n");
foreach (#list) {
# print("Now is : $_");
open(FILE, $_);
$_ = <FILE>;
close(FILE);
unless ($_ =~ /perl/) { # works, but fails to find string "Status Code 1"
print "found\n";
my $filename = 'report.txt';
open(my $fh, '>>', $filename) or die "Could not open file '$filename' $!";
say $fh "My first report generated by perl";
close $fh;
} # end unless
} # end For
There are a number of problems with your code
You must always use strict and use warnings at the top of every Perl program. There is little point in delcaring anything with my without strict in place
The lines returned by the find command will have a newline at the end which must be removed before Perl can find the files
You should use lexical file handles (my $fh instead of FILE) and the three-parameter form of open as you do with your output file
$_ = <FILE> reads only the first line of the file into $_
unless ($_ =~ /perl/) is inverted logic, and there's no need to specify $_ as it is the default. You should write if ( /perl/ )
You can't use say unless you have use feature 'say' at the top of your program (or use 5.010, which adds all features available in Perl v5.10)
It is also best to avoid using shell commands as Perl is more than able to do anything that you can using command line utilities. In this case -f $file is a test that returns true if the file is a plain file, and -M $file returns the (floating point) number of days since the file's modification time
This is how I would write your program
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
for my $file ( glob '/home/ad/*' ) {
next unless -f $file and int(-M $file) == 1;
open my $fh, '<', $file or die $!;
while ( <$fh> ) {
if ( /perl/ ) {
print "found\n";
my $filename = 'report.txt';
open my $out_fh, '>>', $filename or die "Could not open file '$filename': $!";
say $fh "My first report generated by perl";
close $out_fh;
last;
}
}
}
it should have matched unless $_ contains text in different case.
try this.
unless($_ =~ /Status\s+Code\s+1/i) {
Change
unless ($_ =~ /perl/) {
to:
unless ($_ =~ /(Status Code 1)/) {
I am certain the above works, except it's case sensitive.
Since you question it, I rewrote your script to make more sense of what you're trying to accomplish and implement the above suggestion. Correct me if I am wrong, but you're trying to make a script which matches "Status Code 1" in a bunch of files where last modified within 1 day and print the filename to a text file.
Anyways, below is what I recommend:
#!/usr/bin/perl
use strict;
use warnings;
my $output_file = 'report.txt';
my #list = `find /home/ad -type f -mtime -1`;
foreach my $filename (#list) {
print "PROCESSING: $filename";
open (INCOMING, "<$filename") || die "FATAL: Could not open '$filename' $!";
foreach my $line (<INCOMING>) {
if ($line =~ /(Status Code 1)/) {
open( FILE, ">>$output_file") or die "FATAL: Could not open '$output_file' $!";
print FILE sprintf ("%s\n", $filename);
close(FILE) || die "FATAL: Could not CLOSE '$output_file' $!";
# Bail when we get the first match
last;
}
}
close(INCOMING) || die "FATAL: Could not close '$filename' $!";
}

splitting a large file into small files based on column value in perl

I am trying to split up a large file (having around 17.6 million data) into 6-7 small files based on the column value.Currently, I am using sql bcp utility to dump in all data into one table and creating seperate files using bcp out utility.
But someone suggested me to use Perl as it would be more faster and you don't need to create a table for that.As I am not a perl guy. I am not sure how to do it in perl.
Any help..
INPUT file :
inputfile.txt
0010|name|address|city|.........
0020|name|number|address|......
0030|phone no|state|street|...
output files:
0010.txt
0010|name|address|city|.........
0020.txt
0020|name|number|address|......
0030.txt
0030|phone no|state|street|...
It is simplest to keep a hash of output file handles, keyed by the file name. This program shows the idea. The number at the start of each record is used to create the name of the file where it belongs, and file of that name is opened unless we already have a file handle for it.
All of the handles are closed once all of the data has been processed. Any errors are caught by use autodie, so explicit checking of the open, print and close calls is unnecessary.
use strict;
use warnings;
use autodie;
open my $in_fh, '<', 'inputfile.txt';
my %out_fh;
while (<$in_fh>) {
next unless /^(\d+)/;
my $filename = "$1.txt";
open $out_fh{$filename}, '>', $filename unless $out_fh{$filename};
print { $out_fh{$filename} } $_;
}
close $_ for values %out_fh;
Note close caught me out here because, unlike most operators that work on $_ if you pass no parameters, a bare close will close the currently selected file handle. That is a bad choice IMO, but it's way to late to change it now
17.6 million rows is going to be a pretty large file, I'd imagine. It'll still be slow with perl to process.
That said, you're going to want something like the below:
use strict;
use warnings;
my $input = 'FILENAMEHERE.txt';
my %results;
open(my $fh, '<', $input) or die "cannot open input file: $!";
while (<$fh>) {
my ($key) = split '|', $_;
my $array = $results{$key} || [];
push $array, $_;
$results{$key} = $array;
}
for my $filename (keys %results) {
open(my $out, '>', "$filename.txt") or die "Cannot open output file $out: $!";
print $out, join "\n", $results{$filename};
close($out);
}
I haven't explicitly tested this, but it should get you going in the right direction.
$ perl -F'|' -lane '
$key = $F[0];
$fh{$key} or open $fh{$key}, ">", "$key.txt" or die $!;
print { $fh{$key} } $_
' inputfile.txt
perl -Mautodie -ne'
sub out { $h{$_[0]} ||= open(my $f, ">", "$_[0].txt") && $f }
print { out($1) } $_ if /^(\d+)/;
' file

How to search and replace using hash with Perl

I'm new to Perl and I'm afraid I am stuck and wanted to ask if someone might be able to help me.
I have a file with two columns (tab separated) of oldname and newname.
I would like to use the oldname as key and newname as value and store it as a hash.
Then I would like to open a different file (gff file) and replace all the oldnames in there with the newnames and write it to another file.
I have given it my best try but am getting a lot of errors.
If you could let me know what I am doing wrong, I would greatly appreciate it.
Here are how the two files look:
oldname newname(SFXXXX) file:
genemark-scaffold00013-abinit-gene-0.18 SF130001
augustus-scaffold00013-abinit-gene-1.24 SF130002
genemark-scaffold00013-abinit-gene-1.65 SF130003
file to search and replace in (an example of one of the lines):
scaffold00013 maker gene 258253 258759 . - . ID=maker-scaffold00013-augustus-gene-2.187;Name=maker-scaffold00013-augustus-gene-2.187;
Here is my attempt:
#!/usr/local/bin/perl
use warnings;
use strict;
my $hashfile = $ARGV[0];
my $gfffile = $ARGV[1];
my %names;
my $oldname;
my $newname;
if (!defined $hashfile) {
die "Usage: $0 hash_file gff_file\n";
}
if (!defined $gfffile) {
die "Usage: $0 hash_file gff_file\n";
}
###save hashfile with two columns, oldname and newname, into a hash with oldname as key and newname as value.
open(HFILE, $hashfile) or die "Cannot open $hashfile\n";
while (my $line = <HFILE>) {
chomp($line);
my ($oldname, $newname) = split /\t/;
$names{$oldname} = $newname;
}
close HFILE;
###open gff file and replace all oldnames with newnames from %names.
open(GFILE, $gfffile) or die "Cannot open $gfffile\n";
while (my $line2 = <GFILE>) {
chomp($line2);
eval "$line2 =~ s/$oldname/$names{oldname}/g";
open(OUT, ">SFrenamed.gff") or die "Cannot open SFrenamed.gff: $!";
print OUT "$line2\n";
close OUT;
}
close GFILE;
Thank you!
Your main problem is that you aren't splitting the $line variable. split /\t/ splits $_ by default, and you haven't put anything in there.
This program builds the hash, and then constructs a regex from all the keys by sorting them in descending order of length and joining them with the | regex alternation operator. The sorting is necessary so that the longest of all possible choices is selected if there are any alternatives.
Every occurrence of the regex is replaced by the corresponding new name in each line of the input file, and the output written to the new file.
use strict;
use warnings;
die "Usage: $0 hash_file gff_file\n" if #ARGV < 2;
my ($hashfile, $gfffile) = #ARGV;
open(my $hfile, '<', $hashfile) or die "Cannot open $hashfile: $!";
my %names;
while (my $line = <$hfile>) {
chomp($line);
my ($oldname, $newname) = split /\t/, $line;
$names{$oldname} = $newname;
}
close $hfile;
my $regex = join '|', sort { length $b <=> length $a } keys %names;
$regex = qr/$regex/;
open(my $gfile, '<', $gfffile) or die "Cannot open $gfffile: $!";
open(my $out, '>', 'SFrenamed.gff') or die "Cannot open SFrenamed.gff: $!";
while (my $line = <$gfile>) {
chomp($line);
$line =~ s/($regex)/$names{$1}/g;
print $out $line, "\n";
}
close $out;
close $gfile;
Why are you using an eval? And $oldname is going to be undefined in the second while loop, because the first while loop you redeclare them in that scope (even if you used the outer scope, it would store the very last value that you processed, which wouldn't be helpful).
Take out the my $oldname and my $newname at the top of your script, it is useless.
Take out the entire eval line. You need to repeat the regex for each thing you want to replace. Try something like:
$line2 =~ s/$_/$names{$_}/g for keys %names;
Also see Borodin's answer. He made one big regex instead of a loop, and caught your lack of the second argument to split.