Performing a one-liner on multiple input files specified by extension - perl

I'm using the following line to split and process a tab-delimited .txt file:
perl -lane 'next unless $. >30; #array = split /[:,\/]+/, $F[2]; print if $array[1]/$array[2] >0.5 && $array[4] >2' input.txt > output.txt
Is there a way to alter this one-liner in order to perform this on multiple input files without specifying each individually?
Ideally this would be accomplished by performing it on all files within the current directory holding the .txt (or other) file extension - and then outputting a set of modified files names e.g.:
Input:
test1.txt
test2.txt
Output:
test1MOD.txt
test2MOD.txt
I know that I can access the filename to modify it with $ARGV but I do not know how to go about getting it to run on multiple files.
Solution:
perl -i.MOD -lane 'next unless $. >30; #array = split /[:,\/]+/, $F[2]; print if $array[1]/$array[2] >0.5 && $array[4] >2; close ARGV if eof;' *.txt
$. needs to be reset otherwise it throws a division by zero error.

If you don't mind slightly different output file name,
perl -i.MOD -lane'
next unless $. >30;
#array = split /[:,\/]+/, $F[2];
print if $array[1]/$array[2] >0.5 && $array[4] >2;
close ARGV if eof; # Reset $. for each file.
' *.txt

Have you considered calling the perl script from a shell for loop?
for TXT in *.txt; do
OUT=$(basename $TXT .txt)MOD.txt
perl ... $TXT > $OUT
done

Related

Why is there a 0 on a new line when I print in perl?

I'm trying to get the inode alone of a file that is passed through as an argument.
When I extract the inode, however, there is a 0 printed on a new line. I've tried to get rid of it with regex but I can't. I'm passing the script /usr/bin/vim The 0 isn't there when I run the command (ls -i /usr/bin/vim | awk '{print $1}'), but it is when I run my script.
How can I get rid of this 0?
my $filepath = $ARGV[0];
chomp $filepath;
$filepath =~ s/\s//g;
print ("FILEPATH: $filepath\n"); #looks good
my $inode = system("ls -i $filepath | awk '{print \$1}'");
$inode =~ s/[^a-zA-Z0-9]//g;
$inode =~ s/\s//g;
print ("$inode\n");
So my result is
137699967
0
When you invoke system you run the command provided as its argument, and that's what's outputting the inode number.
The return value of system is the exit code of the command run, which in this case is 0, and that's what your subsequent print call outputs.
To run an external program and capture its output, use the qx operator, like so:
my $inode = qx/ls -i $filepath | awk '{print \$1}'"/;
However, as Sobrique explained in their answer, you don't actually need to call an external program, you can use Perl's built-in stat function instead.
my $inode = stat($filepath)[1];
stat returns a list containing a variety of information about a file - index 1 holds its inode. This code won't handle if the file doesn't exist, of course.
Don't, just use the stat builtin instead
print (stat($filepath))[1]."\n";
print join "\n", map { (stat)[1] } #ARGV,"\n"

Getting error while replacing word using perl

I am writing a script for replacing 2 words from a text file. The script is
count=1
for f in *.pdf
do
filename="$(basename $f)"
filename="${filename%.*}"
filename="${filename//_/ }"
echo $filename
echo $f
perl -pe 's/intime_mean_pu.pdf/'$f'/' fig.tex > fig_$count.tex
perl -pi 's/TitleFrame/'$filename'/' fig_$count.tex
sed -i '/Pointer-rk/r fig_'$count'.tex' $1.tex
count=$((count+1))
done
But the replacing of words using the second perl command is giving error:
Can't open perl script "s/TitleFrame/Masses1/": No such file or directory
Please suggest what I am doing wrong.
You could change your script to something like this:
#!/bin/bash
for f in *.pdf; do
filename=$(basename "$f" .pdf)
filename=${filename//_/}
perl -spe 's/intime_mean_pu.pdf/$a/;
s/TitleFrame/$b/' < fig.tex -- -a="$f" -b="$filename" > "fig_$count.tex"
sed -i "/Pointer-rk/r fig_$count.tex" "$1.tex"
((++count))
done
As well as some other minor changes to your script, I have made use of the -s switch to Perl, which means that you can pass arguments to the one-liner. The bash variables have been double quoted to avoid problems with spaces in filenames, etc.
Alternatively, you could do the whole thing in Perl:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use File::Basename;
my $file_arg = shift;
my $count = 1;
for my $f (glob "*.pdf") {
my $name = fileparse($f, qq(.pdf));
open my $in, "<", $file_arg;
open my $out, ">", 'tmp';
open my $fig, "<", 'fig.tex';
# copy up to match
while (<$in>) {
print $out $_;
last if /Pointer-rk/;
}
# insert contents of figure (with substitutions)
while (<$fig>) {
s/intime_mean_pu.pdf/$f/;
s/TitleFrame/$name/;
print $out $_;
}
# copy rest of file
print $out $_ while <$in>;
rename 'tmp', $file_arg;
++$count;
}
Use the script like perl script.pl "$1.tex".
You're missing the -e in the second perl call

Does "print $ARGV" alter the argument array in any way?

Here is the example:
$a = shift;
$b = shift;
push(#ARGV,$b);
$c = <>;
print "\$b: $b\n";
print "\$c: $c\n";
print "\$ARGV: $ARGV\n";
print "\#ARGV: #ARGV\n";
And the output:
$b: file1
$c: dir3
$ARGV: file2
#ARGV: file3 file1
I don't understand what exactly is happening when printing $ARGV without any index. Does it print the first argument and then remove it from the array? Because I thought after all the statements the array becomes:
file2 file3 file1
Invocation:
perl port.pl -axt file1 file2 file3
file1 contains the lines:
dir1
dir2
file2:
dir3
dir4
dir5
file3:
dir6
dir7
Greg has quoted the appropriate documentation, so here's a quick rundown of what happens
$a = shift; # "-axt" is removed from #ARGV and assigned to $a
$b = shift; # "file1" likewise
push(#ARGV,$b); # "file1" inserted at end of #ARGV
$c = <>; # "file2" is removed from #ARGV, and its file
# handle opened, the first line of file2 is read
When the file handle for "file2" is opened, it sets the file name in $ARGV. As Greg mentioned, #ARGV and $ARGV are completely different variables.
The internal workings of the diamond operator <> is probably what is confusing you here, in that it does an approximate $ARGV = shift #ARGV
In Perl, $ARGV and #ARGV are completely different. From perlvar:
$ARGV
Contains the name of the current file when reading from <>.
#ARGV
The array #ARGV contains the command-line arguments intended for the script. $#ARGV is generally the number of arguments minus one, because $ARGV[0] is the first argument, not the program's command name itself. See $0 for the command name.
No, but <> does. <> is short for <ARGV> (which in turn is short for readline(ARGV))
, where ARGV is a special file handle that reads from the files listed in #ARGV (or STDIN if #ARGV is empty). As it opens the files in #ARGV, it removes them from #ARGV and stores them in $ARGV.

unix functions inside perl

I tried to use some unix tools inside a perl driver script because I knew little about writing shell script. My purpose is to just combine a few simple unix commands together so I can run the script on 100 directories in one perl command.
The task is I have more than 100 folders, in each folder, there are n number of files. I want to do the same thing on each folder, which is to combine the files in them and sort the combined file and use bedtools to merge overlapping regions (quite common practice in bioinformatics)
Here is what I have:
#!/usr/bin/perl -w
use strict;
my $usage ="
This is a driver script to merge files in each folder into one combined file
";
die $usage unless #ARGV;
my ($in)=#ARGV;
open (IN,$in)|| die "cannot open $in";
my %hash;
my $final;
while(<IN>){
chomp;
my $tf = $_;
my #array =`ls $tf'/.'`;
my $tmp;
my $tmp2;
foreach my $i (#array){
$tmp = `cut -f 1-3 $tf'/'$i`;
$tmp2 = `cat $tmp`;
}
my $tmp3;
$tmp3=`sort -k1,1 -k2,2n $tmp2`;
$final = `bedtools merge -i $tmp3`;
}
print $final,"\n";
I know that this line isn't working at all..
$tmp2 = `cat $tmp`;
The issue is how to direct the output into another variable in perl and use that variable later on in another unix command...
Please let me know if you can point out where I can change to make it work. Greatly appreciated.
The output from backticks usually includes newlines, which usually have to be removed before using the output downstream. Add some chomp's to your code:
chomp( my #array =`ls $tf'/.'` );
my $tmp;
my $tmp2;
foreach my $i (#array){
chomp( $tmp = `cut -f 1-3 $tf'/'$i` );
chomp( $tmp2 = `cat $tmp` );
}
my $tmp3;
chomp( $tmp3=`sort -k1,1 -k2,2n $tmp2` );
$final = `bedtools merge -i $tmp3`;
To use a perl variable in the shell, this is an example :
#!/usr/bin/env perl
my $var = "/etc/passwd";
my $out = qx(file $var);
print "$out\n";
For the rest, it's very messy. You should take the time learning perl and not mixing coreutils commands and Perl, where perl itself is a better tool to do the whole joke.
OK. I gave it up on perl and decided to give it a try using shell script. It worked!!
Thanks for the above answers though!
for dir in `ls -d */`
do
name=$(basename $dir /)
cd $dir
for file in `ls`
do
cut -f 1-3 $file > $file.tmp
done
for x in `ls *tmp`
do
cat $x >> $name.tmp1
done
sort -k1,1 -k2,2n $name.tmp1 > $name.tmp2
bedtools merge -i $name.tmp2 > $name.combined
done

Unix/Perl - Remove contents of a file before a pattern

I have a file like this
### SECTION 1 ###
data data
data data
### SECTION 2 ###
data data
data data
Now I want everything before SECTION 2 to be removed.
How can I do this in Perl or Unix?
To edit the file in-place:
perl -i -ne 'print if /SECTION 2/..0' file
perl -ne '$m = 1 if $_ =~ /SECTION 2/ ; next unless $m ; print $_;' filename > newfilename
$ perl -pi -e '$_ = "" unless /SECTION 2/ .. /(*FAIL)/' file