Removing rows where the number of interest is below cut-off (Perl) - perl

I have files with several columns that contain text in the following format “number/number:zero,number_of_interest”. Example: “1/1:0,13”.
I need to remove rows if the number of interest is less than 20 in any of the columns.
I prefer to use egrep and not to read in the file but not sure how to evaluate the number of interest in each column in a single statement.
I’m also removing rows that contain SVLEN=-1 or SVLEN=-2 and it seems to be working well with egrep:
$cmd2 = `egrep -v 'SVLEN=-1;|SVLEN=-2;' $my_vcf > $my_new_vcf`; print $cmd1;
I've tried the following but it did not work:
my $cmd2 = `egrep -v 'SVLEN=-1;|SVLEN=-2;|(\,(\d+) < 20)' $my_vcf > $my_new_vcf`; print $cmd2;
Thank you.

egrep is the wrong tool for this purpose as it can't do math within its regular expression.
Because you already have a Perl script you are better off using Perl commands to achieve your goal.
Unfortunately you have to open and read the file line by line to do so, but that is exactly what egrep does. How else could it judge the lines?
#!/usr/bin/env perl
use strict;
use warnings;
use List::Util;
sub filter_lines
{
my $in_filename = shift;
my $out_filename = shift;
open( my $fhin, '<', $in_filename ) or die "cannot open $in_filename: $!\n";
open( my $fhout, '>', $out_filename ) or die "cannot open $out_filename: $!\n";
while ( my $line = <$fhin> ) {
next if ( $line =~ /SVLEN=-1;|SVLEN=-2;/ );
if ( my #numbers_of_interest = ( $line =~ m/\d+\/\d+:0,(\d+)/g ) ) {
next unless List::Util::min(#numbers_of_interest) < 20;
}
print $fhout $line;
}
close($fhin);
close($fhout);
}
filter_lines( $my_vcf, $my_new_vcf );
Because I have no exact input line the pattern for the #numbers_of_interest might be a bit inaccurate and need improvement. There's also much room for optimization in case this code turns out to be slow.

Related

Search for multiple terms in perl

I have a file with more than hundred single column entries. I need to search for each of these entries into a file of multiple column and more than thousand entries and need a output file. I tried these codes:
#!/usr/bin/perl -w
use strict;
use warnings;
print "Enter the input file name:";
my $inputfile = <STDIN>;
chomp($inputfile);
print "\nEnter the search file name:";
my $searchfile=<STDIN>;
chomp($searchfile);
open (INPUTFILE, $inputfile) || die;
open (SEARCHFILE, $searchfile) || die;
open (OUT, ">write.txt") || die;
while (my $line=<SEARCHFILE>){
while (<INPUTFILE>) {
if (/$line/){
print OUT $_;
}
}
}
close (INPUTFILE) || die;
close (SEARCHFILE) || die;
close (OUT) || die;
The output file has only one line. It has searched the term from the search file into input file, but only for the first term, not for all. Please help!
When you read INPUTFILE in the inner loop, it's read to the end during the first round of SEARCHFILE. Because it's not reset, the filehandle is used up and will always return eof.
If there are hundreds of lines, but not several 100,000 you can easily read it into an array first and then use that for the lookup. The fact that it's single-column makes that very easy. Note that this is less efficient then the alternative solution below.
chomp( my #needles = <SEARCHFILE> );
while (<INPUTFILE>) {
foreach my $needle (#needles) {
print OUT $_ if m/\Q$needle\E/; # \Q end \E quote regex meta chars
}
}
Alternatively you can also build one large lookup regex that matches all the strings in one go. That is probably faster than iterating the array for each line.
# open ...
chomp( my #needles = <SEARCHFILE> );
my $lookup = join '|', map quotemeta, #needles;
my $lookup_regex = qr/$lookup/; # possibly with /i?
while (my $line = <INPUTFILE>) {
print OUT $line if $line =~ $lookup_regex;
}
The quotemeta takes care of strings that contain regex meta characters like / or | or even .. It's the same as using \Q and \E as above.
Please also use three-argument-open and named filehandles.
open my $fh_searchfile, '<', $searchfile or die $!;
open my $fh_inputfile, '<', $inputfile or die $!;
open my $fh_out, '>', 'write.txt' or die $!;
chomp( my #needles = <$fh_searchfile> );
# ...
The three-argument-open is important because you are taking user input and using it as the filename directly. A malicious user could enter something like | rm -rf *, which would open a pipe to a delete all my files without asking program. Oops. But if you specify the '<' read open method explicitly in its own parameter, the method characters are ignored in the third param.
The lexical filehandle $fh is, as the name says, lexical, while INPUTFILE is a GLOB, which makes it global. That's not so bad if you only have this one script and no modules, but as soon as you deal with different packages it becomes problematic because those are super-global and every part of the program sees them. That can lead to name collisions and weird stuff happening.

Perl modifying CSV files

I have a small section of code I'm trying to modify.
What I'm trying to do is have the filename inputted into the third column. At the moment I almost have it working, but I'd like to remove the ".csv"s from the end of each entry in the column. I'd also like to give the column the heading "filename".
I hope the difference between "table1" and "table2" shown above summarises quite well the modification which I'm trying to make here.
The code I'm currently using to create "table1" is the following:
#!/usr/bin/perl
use warnings;
use strict;
open M,"<mapcodelist.txt" or die "mapcodelist.txt $!";
my %m;
while( <M> ){
my($k,$v)=split;
$v=~s/\./_/g;
$m{$k}=$v;
}
close M;
chdir "C:/Users/Stephen/Desktop/Database_Design/" or die $!;
#ARGV=<*.csv>;
$^I=".bak";
while( <> ){
chomp;
$\=/^mass/?",filename$/": ",$ARGV$/";
print;
}
for( <*.csv> ){
my $r;
($r=$_) =~ s/\w+_(\w+)(?=\.csv)/$1_$m{$1}/;
rename $_,$r or warn " rename $_,$r $!";
}
Any advice with this would be very much appreciated.
Thanks.
You can try following perl script:
#!/usr/bin/env perl;
use strict;
use warnings;
use Text::CSV_XS;
my ($prev_lc);
open my $fh, '<', shift or die;
my $csv = Text::CSV_XS->new({ eol => "\n" }) or die;
while ( my $row = $csv->getline($fh) ) {
if ( $csv->record_number == 1 ) {
$prev_lc = $row->[$#$row];
$csv->print( \*STDOUT, [ #$row[0 .. $#$row - 1], 'Filename' ] );
next;
}
$prev_lc =~ s/\.csv$//;
$csv->print( \*STDOUT, [ #$row[0 .. $#$row - 1], $prev_lc ] );
## Previous last column.
$prev_lc = $row->[$#$row];
}
It uses an auxiliar variable to add the missing header and process each whole data line at the same time. I simply use a regular expression to remove the extension.
With following dummy test data (infile) and assuming that last line doesn't have a file name because of the header:
mass,intensity,20130730_p12_A2.csv
2349.345,56.23423,20130730_p12_A2.csv
744.2884,5.01
Run the script like:
perl script.pl infile
That yields:
mass,intensity,Filename
2349.345,56.23423,20130730_p12_A2
744.2884,20130730_p12_A2
Perhaps it's not perfect based in particular data that you didn't show, and I didn't take into account all that code that you posted where you handle many files. But you can see that it works in the way you asked it and it's left as work for you to adapt it to your needs, if neccesary.

Getting unique random line (at each script run) from an text file with perl

Having an text file like the next one called "input.txt"
some field1a | field1b | field1c
...another approx 1000 lines....
fielaNa | field Nb | field Nc
I can choose any field delimiter.
Need a script, what at every discrete run will get one unique (never repeated) random line from this file, until used all lines.
My solution: I added one column into a file, so have
0|some field1a | field1b | field1c
...another approx 1000 lines....
0|fielaNa | field Nb | field Nc
and processing it with the next code:
use 5.014;
use warnings;
use utf8;
use List::Util;
use open qw(:std :utf8);
my $file = "./input.txt";
#read all lines into array and shuffle them
open(my $fh, "<:utf8", $file);
my #lines = List::Util::shuffle map { chomp $_; $_ } <$fh>;
close $fh;
#search for the 1st line what has 0 at the start
#change the 0 to 1
#and rewrite the whole file
my $random_line;
for(my $i=0; $i<=$#lines; $i++) {
if( $lines[$i] =~ /^0/ ) {
$random_line = $lines[$i];
$lines[$i] =~ s/^0/1/;
open($fh, ">:utf8", $file);
print $fh join("\n", #lines);
close $fh;
last;
}
}
$random_line = "1|NO|more|lines" unless( $random_line =~ /\w/ );
do_something_with_the_fields(split /\|/, $random_line))
exit;
It is an working solution, but not very nice one, because:
the line order is changing at each script run
not concurrent script-run safe.
How to write it more effective and more elegantly?
What about keeping a shuffled list of the line numbers in a different file, removing the first one each time you use it? Some locking might be needed to asure concurent script-run safety.
From perlfaq5.
How do I select a random line from a file?
Short of loading the file into a database or pre-indexing the lines in
the file, there are a couple of things that you can do.
Here's a reservoir-sampling algorithm from the Camel Book:
srand;
rand($.) < 1 && ($line = $_) while <>;
This has a significant advantage in space over reading the whole file
in. You can find a proof of this method in The Art of Computer
Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.
You can use the File::Random module which provides a function for that
algorithm:
use File::Random qw/random_line/;
my $line = random_line($filename);
Another way is to use the Tie::File module, which treats the entire
file as an array. Simply access a random array element.
All Perl programmers should take the time to read the FAQ.
Update: To get a unique random line each time you're going to have to store state. The easiest way to store the state is to remove the lines that you've used from the file.
This program uses the Tie::File module to open your input.txt file as well as an indices.txt file.
If indices.txt is empty then it is initialised with the indices of all the records in input.txt in a shuffled order.
Each run, the index at the end of the list is removed and the corresponding input record displayed.
use strict;
use warnings;
use Tie::File;
use List::Util 'shuffle';
tie my #input, 'Tie::File', 'input.txt'
or die qq(Unable to open "input.txt": $!);
tie my #indices, 'Tie::File', 'indices.txt'
or die qq(Unable to open "indices.txt": $!);
#indices = shuffle(0..$#input) unless #indices;
my $index = pop #indices;
print $input[$index];
Update
I have modified this solution so that it populates a new indices.txt file only if it doesn't already exist and not, as before, simply when it is empty. That means a new sequence of records can be printed simply by deleting the indices.txt file.
use strict;
use warnings;
use Tie::File;
use List::Util 'shuffle';
my ($input_file, $indices_file) = qw( input.txt indices.txt );
tie my #input, 'Tie::File', $input_file
or die qq(Unable to open "$input_file": $!);
my $first_run = not -f $indices_file;
tie my #indices, 'Tie::File', $indices_file
or die qq(Unable to open "$indices_file": $!);
#indices = shuffle(0..$#input) if $first_run;
#indices or die "All records have been displayed";
my $index = pop #indices;
print $input[$index];

Read the last line of file with data in Perl

I have a text file to parse in Perl. I parse it from the start of file and get the data that is needed.
After all that is done I want to read the last line in the file with data. The problem is that the last two lines are blank. So how do I get the last line that holds any data?
If the file is relatively short, just read on from where you finished getting the data, keeping the last non-blank line:
use autodie ':io';
open(my $fh, '<', 'file_to_read.txt');
# get the data that is needed, then:
my $last_non_blank_line;
while (my $line = readline $fh) {
# choose one of the following two lines, depending what you meant
if ( $line =~ /\S/ ) { $last_non_blank_line = $line } # line isn't all whitespace
# if ( line !~ /^$/ ) { $last_non_blank_line = $line } # line has no characters before the newline
}
If the file is longer, or you may have passed the last non-blank line in your initial data gathering step, reopen it and read from the end:
my $backwards = File::ReadBackwards->new( 'file_to_read.txt' );
my $last_non_blank_line;
do {
$last_non_blank_line = $backwards->readline;
} until ! defined $last_non_blank_line || $last_non_blank_line =~ /\S/;
perl -e 'while (<>) { if ($_) {$last = $_;} } print $last;' < my_file.txt
You can use the module File::ReadBackwards in the following way:
use File::ReadBackwards ;
$bw = File::ReadBackwards->new('filepath') or
die "can't read file";
while( defined( $log_line = $bw->readline ) ) {
print $log_line ;
exit 0;
}
If they're blank, just check $log_line for a match with \n;
If the file is small, I would store it in an array and read from the end. If its large, use File::ReadBackwards module.
Here's my variant of command line perl solution:
perl -ne 'END {print $last} $last= $_ if /\S/' file.txt
No one mentioned Path::Tiny. If the file size is relativity small you can do this:
use Path::Tiny;
my $file = path($file_name);
my ($last_line) = $file->lines({count => -1});
CPAN page.
Just remember for the large file, just as #ysth said it's better to use File::ReadBackwards. The difference can be substantial.
sometimes it is more comfortable for me to run shell commands from perl code. so I'd prefer following code to resolve the case:
$result=`tail -n 1 /path/file`;

Comparing lines in a file with perl

Ive been trying to compare lines between two files and matching lines that are the same.
For some reason the code below only ever goes through the first line of 'text1.txt' and prints the 'if' statement regardless of if the two variables match or not.
Thanks
use strict;
open( <FILE1>, "<text1.txt" );
open( <FILE2>, "<text2.txt" );
foreach my $first_file (<FILE1>) {
foreach my $second_file (<FILE2>) {
if ( $second_file == $first_file ) {
print "Got a match - $second_file + $first_file";
}
}
}
close(FILE1);
close(FILE2);
If you compare strings, use the eq operator. "==" compares arguments numerically.
Here is a way to do the job if your files aren't too large.
#!/usr/bin/perl
use Modern::Perl;
use File::Slurp qw(slurp);
use Array::Utils qw(:all);
use Data::Dumper;
# read entire files into arrays
my #file1 = slurp('file1');
my #file2 = slurp('file2');
# get the common lines from the 2 files
my #intersect = intersect(#file1, #file2);
say Dumper \#intersect;
A better and faster (but less memory efficient) approach would be to read one file into a hash, and then search for lines in the hash table. This way you go over each file only once.
# This will find matching lines in two files,
# print the matching line and it's line number in each file.
use strict;
open (FILE1, "<text1.txt") or die "can't open file text1.txt\n";
my %file_1_hash;
my $line;
my $line_counter = 0;
#read the 1st file into a hash
while ($line=<FILE1>){
chomp ($line); #-only if you want to get rid of 'endl' sign
$line_counter++;
if (!($line =~ m/^\s*$/)){
$file_1_hash{$line}=$line_counter;
}
}
close (FILE1);
#read and compare the second file
open (FILE2,"<text2.txt") or die "can't open file text2.txt\n";
$line_counter = 0;
while ($line=<FILE2>){
$line_counter++;
chomp ($line);
if (defined $file_1_hash{$line}){
print "Got a match: \"$line\"
in line #$line_counter in text2.txt and line #$file_1_hash{$line} at text1.txt\n";
}
}
close (FILE2);
You must re-open or reset the pointer of file 2. Move the open and close commands to within the loop.
A more efficient way of doing this, depending on file and line sizes, would be to only loop through the files once and save each line that occurs in file 1 in a hash. Then check if the line was there for each line in file 2.
If you want the number of lines,
my $count=`grep -f [FILE1PATH] -c [FILE2PATH]`;
If you want the matching lines,
my #lines=`grep -f [FILE1PATH] [FILE2PATH]`;
If you want the lines which do not match,
my #lines = `grep -f [FILE1PATH] -v [FILE2PATH]`;
This is a script I wrote that tries to see if two file are identical, although it could easily by modified by playing with the code and switching it to eq. As Tim suggested, using a hash would probably be more effective, although you couldn't ensure the files were being compared in the order they were inserted without using a CPAN module (and as you can see, this method should really use two loops, but it was sufficient for my purposes). This isn't exactly the greatest script ever, but it may give you somewhere to start.
use warnings;
open (FILE, "orig.txt") or die "Unable to open first file.\n";
#data1 = ;
close(FILE);
open (FILE, "2.txt") or die "Unable to open second file.\n";
#data2 = ;
close(FILE);
for($i = 0; $i < #data1; $i++){
$data1[$i] =~ s/\s+$//;
$data2[$i] =~ s/\s+$//;
if ($data1[$i] ne $data2[$i]){
print "Failure to match at line ". ($i + 1) . "\n";
print $data1[$i];
print "Doesn't match:\n";
print $data2[$i];
print "\nProgram Aborted!\n";
exit;
}
}
print "\nThe files are identical. \n";
Taking the code you posted, and transforming it into actual Perl code, this is what I came up with.
use strict;
use warnings;
use autodie;
open my $fh1, '<', 'text1.txt';
open my $fh2, '<', 'text2.txt';
while(
defined( my $line1 = <$fh1> )
and
defined( my $line2 = <$fh2> )
){
chomp $line1;
chomp $line2;
if( $line1 eq $line2 ){
print "Got a match - $line1\n";
}else{
print "Lines don't match $line1 $line2"
}
}
close $fh1;
close $fh2;
Now what you may really want is a diff of the two files, which is best left to Text::Diff.
use strict;
use warnings;
use Text::Diff;
print diff 'text1.txt', 'text2.txt';